Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Li, Yingzhen; Mandt, Stephan; Agrawal, Shipra; Khan, Emtiyaz (Ed.)We study the problem of causal effect estimation in the presence of unobserved confounders, focusing on two settings: instrumental variable (IV) regression with additional observed confounders, and proxy causal learning. Our approach uses a singular value decomposition of a conditional expectation operator combined with a saddle-point optimization method. In the IV regression setting, this can be viewed as a neural network generalization of the seminal approach due to Darolles et al. (2011). Saddle-point formulations have recently gained attention because they mitigate the double-sampling bias and are compatible with modern function approximation methods. We provide experimental validation across various settings and show that our approach outperforms existing methods on common benchmarks.more » « lessFree, publicly-accessible full text available May 3, 2026
-
Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear ModelsThe gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. In GLMs with a polynomial link function, it has been shown that in the high signal-to-noise ratio (SNR) regime, due to the problem's strong convexity and smoothness, GD converges linearly and reaches the final desired accuracy in a logarithmic number of iterations. In contrast, in the low SNR setting, where the problem becomes locally convex, GD converges at a slower rate and requires a polynomial number of iterations to reach the desired accuracy. Even though Newton's method can be used to resolve the flat curvature of the loss functions in the low SNR case, its computational cost is prohibitive in high-dimensional settings as it is $$\mathcal{O}(d^3)$$, where $$d$$ the is the problem dimension. To address the shortcomings of GD and Newton's method, we propose the use of the BFGS quasi-Newton method to solve parameter estimation of the GLMs, which has a per iteration cost of $$\mathcal{O}(d^2)$$. When the SNR is low, for GLMs with a polynomial link function of degree $$p$$, we demonstrate that the iterates of BFGS converge linearly to the optimal solution of the population least-square loss function, and the contraction coefficient of the BFGS algorithm is comparable to that of Newton's method. Moreover, the contraction factor of the linear rate is independent of problem parameters and only depends on the degree of the link function $$p$$. Also, for the empirical loss with $$n$$ samples, we prove that in the low SNR setting of GLMs with a polynomial link function of degree $$p$$, the iterates of BFGS reach a final statistical radius of $$\mathcal{O}((d/n)^{\frac{1}{2p+2}})$$ after at most $$\log(n/d)$$ iterations. This complexity is significantly less than the number required for GD, which scales polynomially with $(n/d)$.more » « less
-
We propose MaxUp, an embarrassingly simple, highly effective technique for improving the generalization performance of machine learning models, especially deep neural networks. The idea is to generate a set of augmented data with some random perturbations or transforms and minimize the maximum, or worst case loss over the augmented data. By doing so, we implicitly introduce a smoothness or robustness regularization against the random perturbations, and hence improve the generation performance. For example, in the case of Gaussian perturbation, MaxUp is asymptotically equivalent to using the gradient norm of the loss as a penalty to encourage smoothness. We test MaxUp on a range of tasks, including image classification, language modeling, and adversarial certification, on which MaxUp consistently outperforms the existing best baseline methods, without introducing substantial computational overhead. In particular, we improve ImageNet classification from the state-of-the-art top-1 accuracy 85.5% without extra data to 85.8%. Code will be released soon.more » « less
-
We propose a new Stein self-repulsive dynamics for obtaining diversified samples from intractable un-normalized distributions. Our idea is to introduce Stein variational gradient as a repulsive force to push the samples of Langevin dynamics away from the past trajectories. This simple idea allows us to significantly decrease the auto-correlation in Langevin dynamics and hence increase the effective sample size. Importantly, as we establish in our theoretical analysis, the asymptotic stationary distribution remains correct even with the addition of the repulsive force, thanks to the special properties of the Stein variational gradient. We perform extensive empirical studies of our new algorithm, showing that our method yields much higher sample efficiency and better uncertainty estimation than vanilla Langevin dynamics.more » « less
An official website of the United States government

Full Text Available